A typical R project may include the following steps (depending on the purpose):
Step 1: Reading libraries needed for the project
Step 2: Loading dataset
Step 3: Analysing and manipulating the dataset and data visualisation
Step 4: Fitting models and testing goodness of fit
Step 5: Data output
This workshop is to show an example of using R for generalised linear model fitting and to help people get started with R.
This document contains R codes used in this example
Loading the R libraries that are needed for the project
#install and load packages
#install.packages("readxl")
library(readxl)
library(readr)
library(knitr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
Import dataset using readxl and readr. In this example, we would load a dataset from package insuranceData.
For more Singapore Automobile Claims data description:
The data is from the General Insurance Association of Singapore, an organization consisting of general (property and casualty) insurers in Singapore (see the organization’s website: www.gia.org.sg).From this database, several characteristics are available to explain automobile accident frequency.These characteristics include vehicle variables, such as type and age, as well as person level variables, such as age, gender and prior driving experience.
About InsuranceData packages: https://cran.r-project.org/web/packages/insuranceData/insuranceData.pdf
#Inputfile<-"D:/YT_R/HandsOn/SingaporeData.csv"
#InputData<-read_csv(Inputfile)
library(insuranceData)
data(SingaporeAuto)
df <-as.data.frame(SingaporeAuto)
names(df)
## [1] "SexInsured" "Female" "VehicleType" "PC" "Clm_Count"
## [6] "Exp_weights" "LNWEIGHT" "NCD" "AgeCat" "AutoAge0"
## [11] "AutoAge1" "AutoAge2" "AutoAge" "VAgeCat" "VAgecat1"
dim(df)
## [1] 7483 15
colnames(df)
## [1] "SexInsured" "Female" "VehicleType" "PC" "Clm_Count"
## [6] "Exp_weights" "LNWEIGHT" "NCD" "AgeCat" "AutoAge0"
## [11] "AutoAge1" "AutoAge2" "AutoAge" "VAgeCat" "VAgecat1"
head(df,10)
## SexInsured Female VehicleType PC Clm_Count Exp_weights LNWEIGHT NCD
## 1 U 0 T 0 0 0.6680356 -0.40341383 30
## 2 U 0 T 0 0 0.5667351 -0.56786326 30
## 3 U 0 T 0 0 0.5037645 -0.68564629 30
## 4 U 0 T 0 0 0.9144422 -0.08944106 20
## 5 U 0 T 0 0 0.5366188 -0.62246739 20
## 6 U 0 T 0 0 0.7529090 -0.28381095 20
## 7 U 0 M 0 0 0.6707734 -0.39932384 20
## 8 U 0 M 0 0 0.8377823 -0.17699695 20
## 9 U 0 M 0 0 0.1670089 -1.78970819 20
## 10 U 0 M 0 0 0.5037645 -0.68564629 20
## AgeCat AutoAge0 AutoAge1 AutoAge2 AutoAge VAgeCat VAgecat1
## 1 0 0 0 0 0 0 2
## 2 0 0 0 0 0 0 2
## 3 0 0 0 0 0 0 2
## 4 0 0 0 0 0 0 2
## 5 0 0 0 0 0 0 2
## 6 0 0 0 0 0 0 2
## 7 0 0 0 0 0 6 6
## 8 0 0 0 0 0 6 6
## 9 0 0 0 0 0 6 6
## 10 0 0 0 0 0 6 6
summary(df)
## SexInsured Female VehicleType PC
## F: 700 Min. :0.00000 A :3842 Min. :0.0000
## M:3145 1st Qu.:0.00000 G :2882 1st Qu.:0.0000
## U:3638 Median :0.00000 Q : 358 Median :1.0000
## Mean :0.09355 M : 188 Mean :0.5134
## 3rd Qu.:0.00000 P : 88 3rd Qu.:1.0000
## Max. :1.00000 Z : 71 Max. :1.0000
## (Other): 54
## Clm_Count Exp_weights LNWEIGHT NCD
## Min. :0.00000 Min. :0.005476 Min. :-5.2074 Min. : 0.00
## 1st Qu.:0.00000 1st Qu.:0.279261 1st Qu.:-1.2756 1st Qu.: 0.00
## Median :0.00000 Median :0.503764 Median :-0.6856 Median :20.00
## Mean :0.06989 Mean :0.519859 Mean :-0.8945 Mean :19.85
## 3rd Qu.:0.00000 3rd Qu.:0.752909 3rd Qu.:-0.2838 3rd Qu.:30.00
## Max. :3.00000 Max. :1.000000 Max. : 0.0000 Max. :50.00
##
## AgeCat AutoAge0 AutoAge1 AutoAge2
## Min. :0.00 Min. :0.0000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :2.00 Median :0.0000 Median :0.00000 Median :0.00000
## Mean :1.94 Mean :0.3905 Mean :0.05867 Mean :0.05987
## 3rd Qu.:4.00 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :7.00 Max. :1.0000 Max. :1.00000 Max. :1.00000
##
## AutoAge VAgeCat VAgecat1
## Min. :0.000 Min. :0.000 Min. :2.000
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:2.000
## Median :1.000 Median :1.000 Median :2.000
## Mean :0.509 Mean :2.019 Mean :2.933
## 3rd Qu.:1.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.000 Max. :6.000 Max. :6.000
##
str(df)
## 'data.frame': 7483 obs. of 15 variables:
## $ SexInsured : Factor w/ 3 levels "F","M","U": 3 3 3 3 3 3 3 3 3 3 ...
## $ Female : int 0 0 0 0 0 0 0 0 0 0 ...
## $ VehicleType: Factor w/ 9 levels "A","G","M","P",..: 7 7 7 7 7 7 3 3 3 3 ...
## $ PC : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Clm_Count : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Exp_weights: num 0.668 0.567 0.504 0.914 0.537 ...
## $ LNWEIGHT : num -0.4034 -0.5679 -0.6856 -0.0894 -0.6225 ...
## $ NCD : int 30 30 30 20 20 20 20 20 20 20 ...
## $ AgeCat : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AutoAge0 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AutoAge1 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AutoAge2 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AutoAge : int 0 0 0 0 0 0 0 0 0 0 ...
## $ VAgeCat : int 0 0 0 0 0 0 6 6 6 6 ...
## $ VAgecat1 : int 2 2 2 2 2 2 6 6 6 6 ...
hist(df$Clm_Count)
table(df$Clm_Count)
##
## 0 1 2 3
## 6996 455 28 4
levels(as.factor(df$SexInsured))
## [1] "F" "M" "U"
Data Analysis - understanding the dataset
#Summary of claim count by chosen factors using dplyr
df%>%
group_by(VehicleType,SexInsured)%>%
summarise(number=n(),
Count=sum(Clm_Count),
Exposure=sum(Exp_weights),
Frequency=Count/Exposure)
## # A tibble: 13 x 6
## # Groups: VehicleType [9]
## VehicleType SexInsured number Count Exposure Frequency
## <fct> <fct> <int> <int> <dbl> <dbl>
## 1 A F 700 50 362. 0.138
## 2 A M 3142 254 1600. 0.159
## 3 G M 1 0 0.167 0
## 4 G U 2881 169 1520. 0.111
## 5 M M 1 0 0.999 0
## 6 M U 187 4 105. 0.0383
## 7 P U 88 11 46.5 0.237
## 8 Q U 358 28 193. 0.145
## 9 S U 16 1 8.64 0.116
## 10 T U 8 0 4.29 0
## 11 W M 1 0 0.914 0
## 12 W U 29 1 12.6 0.0793
## 13 Z U 71 5 36.8 0.136
# Exercise : choose other factors and produce a summary table
Data Cleaning - generating a cleaned dataset
#Summary of claim count by chosen factors using dplyr
df_clean <- df %>%
transmute(SexInsured=factor(SexInsured),
Female=factor(Female),
Private=factor(PC),
NCDCat=factor(NCD),
AgeCat=factor(AgeCat),
VAgeCat=factor(VAgeCat),
Exp_weights,
Clm_Count,
Frequency = Clm_Count/Exp_weights
)
# Exercise : choose other factors and produce a summary table
#Using self-defined functions:
avg <- function(x) {
dat <- aggregate(df$Clm_Count, by = list(df[, x]), FUN = mean)
barplot(dat$x, xlab = x, ylab = "Claim Count Averages")
axis(side=1, at=1:nrow(dat), labels=dat$Group.1)
}
par(mfrow=c(2,2))
avg(x = "AgeCat")
avg(x = "VehicleType")
avg(x = "AutoAge1")
avg(x = "VAgeCat")
avg(x = "NCD")
avg(x = "PC")
avg(x = "SexInsured")
#Using plotly:
df_clean$ID <- seq.int(nrow(df_clean))
y<-sort(df_clean$Frequency,decreasing=FALSE)
plot_ly(df_clean,x=df_clean$ID,y=y,name="test",type='scatter',mode='lines')
#Using ggplot2:
#two-way
ggplot(df_clean, aes(x = SexInsured, fill = NCDCat)) +
geom_bar(position = "fill") +
theme_classic()+theme(axis.text.x = element_text(angle = 90))
# explore frequency by factors
ggplot(data=df_clean, aes(x=AgeCat, y=Clm_Count/Exp_weights, fill=SexInsured)) +
geom_bar(stat="identity", position=position_dodge())
ggplot(data=df_clean, aes(x=AgeCat, y=Clm_Count, fill=SexInsured)) +
geom_bar(stat="identity", position=position_dodge())
ggplot(data=df_clean, aes(x=AgeCat, y=Clm_Count)) +
geom_bar(stat="identity", position=position_dodge())
g <- ggplot(data=df_clean)+scale_y_continuous(labels = scales::percent)
germ_claims <- g+geom_bar(aes_string(x="Clm_Count",y="..prop..",weight="Exp_weights/sum(Exp_weights)"))+ggtitle("Claim counts")+labs(x=NULL, y="% exposure")
germ_claims
#Facet the claims bar chart
germ_claims+facet_grid(.~NCDCat)+ggtitle("Claim counts by NCDCat")
germ_claims+facet_grid(.~AgeCat)+ggtitle("Claim counts by AgeCat")
training <- 0.80
df_training <- sample(1:nrow(df_clean), size = nrow(df_clean) * training, replace = FALSE)
df_train <- df_clean [df_training,]
df_validate <- df_clean [-df_training,]
poisson_intercept <- glm(Clm_Count~1, data =df_train, family=poisson(link=log),offset=log(Exp_weights))
summary(poisson_intercept)
##
## Call:
## glm(formula = Clm_Count ~ 1, family = poisson(link = log), data = df_train,
## offset = log(Exp_weights))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.5137 -0.4457 -0.3484 -0.2449 3.9932
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.02552 0.04945 -40.96 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2141.4 on 5985 degrees of freedom
## Residual deviance: 2141.4 on 5985 degrees of freedom
## AIC: 2918.8
##
## Number of Fisher Scoring iterations: 6
paste("Check average model frequency",exp(poisson_intercept$coefficients[["(Intercept)"]]),"vs data average frequency",sum(df_train$Clm_Count)/sum(df_train$Exp_weights))
## [1] "Check average model frequency 0.131925376321422 vs data average frequency 0.13192537632136"
model_full <-glm(Clm_Count ~ .-Exp_weights-Frequency-ID , data =df_train, family=poisson(link=log),offset=log(Exp_weights))
model_stepwise <- step(poisson_intercept, scope=list(lower=poisson_intercept, upper=model_full), direction = "both")
## Start: AIC=2918.83
## Clm_Count ~ 1
##
## Df Deviance AIC
## + VAgeCat 6 2068.9 2858.4
## + NCDCat 5 2118.1 2905.6
## + Private 1 2131.3 2910.8
## + SexInsured 2 2130.3 2911.8
## + AgeCat 6 2123.8 2913.3
## <none> 2141.4 2918.8
## + Female 1 2141.4 2920.8
##
## Step: AIC=2858.37
## Clm_Count ~ VAgeCat
##
## Df Deviance AIC
## + NCDCat 5 2042.7 2842.2
## <none> 2068.9 2858.4
## + Female 1 2067.3 2858.8
## + Private 1 2068.2 2859.6
## + SexInsured 2 2066.8 2860.3
## + AgeCat 6 2060.0 2861.5
## - VAgeCat 6 2141.4 2918.8
##
## Step: AIC=2842.21
## Clm_Count ~ VAgeCat + NCDCat
##
## Df Deviance AIC
## <none> 2042.7 2842.2
## + Female 1 2041.2 2842.7
## + Private 1 2042.7 2844.2
## + SexInsured 2 2041.2 2844.6
## + AgeCat 6 2036.9 2848.4
## - NCDCat 5 2068.9 2858.4
## - VAgeCat 6 2118.1 2905.6
formulas_1<-"Clm_Count ~ SexInsured"
formulas_2<-"Clm_Count ~ NCDCat + VAgeCat"
formulas_3<-"Clm_Count ~ SexInsured+NCDCat+AgeCat+VAgeCat"
poisson_reg1 <- glm(formulas_1, data =df_train, family=poisson(link=log),offset=log(Exp_weights))
summary(poisson_reg1)
##
## Call:
## glm(formula = formulas_1, family = poisson(link = log), data = df_train,
## offset = log(Exp_weights))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.5606 -0.4318 -0.3434 -0.2368 4.1161
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.0397 0.1644 -12.407 <2e-16 ***
## SexInsuredM 0.1892 0.1789 1.058 0.290
## SexInsuredU -0.1561 0.1813 -0.861 0.389
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2141.4 on 5985 degrees of freedom
## Residual deviance: 2130.3 on 5983 degrees of freedom
## AIC: 2911.8
##
## Number of Fisher Scoring iterations: 6
poisson_reg2 <- glm(formulas_2, data =df_train, family=poisson(link=log),offset=log(Exp_weights))
summary(poisson_reg2)
##
## Call:
## glm(formula = formulas_2, family = poisson(link = log), data = df_train,
## offset = log(Exp_weights))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7621 -0.4265 -0.3116 -0.2001 3.7668
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.5710 0.1014 -15.500 < 2e-16 ***
## NCDCat10 -0.3733 0.1442 -2.589 0.00961 **
## NCDCat20 -0.4350 0.1475 -2.950 0.00318 **
## NCDCat30 -0.3535 0.2148 -1.645 0.09992 .
## NCDCat40 -0.7144 0.2723 -2.623 0.00871 **
## NCDCat50 -0.6656 0.1509 -4.411 1.03e-05 ***
## VAgeCat1 0.2209 0.1556 1.420 0.15572
## VAgeCat2 0.3353 0.1534 2.186 0.02885 *
## VAgeCat3 0.1108 0.1670 0.664 0.50678
## VAgeCat4 -0.4739 0.1858 -2.551 0.01075 *
## VAgeCat5 -1.2482 0.2326 -5.366 8.07e-08 ***
## VAgeCat6 -1.5681 0.5842 -2.684 0.00727 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2141.4 on 5985 degrees of freedom
## Residual deviance: 2042.7 on 5974 degrees of freedom
## AIC: 2842.2
##
## Number of Fisher Scoring iterations: 6
poisson_reg3 <- glm(formulas_3, data =df_train, family=poisson(link=log),offset=log(Exp_weights))
summary(poisson_reg3)
##
## Call:
## glm(formula = formulas_3, family = poisson(link = log), data = df_train,
## offset = log(Exp_weights))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8912 -0.4229 -0.3086 -0.1989 3.7722
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -11.9951 298.1363 -0.040 0.96791
## SexInsuredM 0.2319 0.1795 1.292 0.19643
## SexInsuredU 10.4117 298.1363 0.035 0.97214
## NCDCat10 -0.3815 0.1444 -2.642 0.00823 **
## NCDCat20 -0.4346 0.1476 -2.945 0.00323 **
## NCDCat30 -0.3513 0.2182 -1.610 0.10738
## NCDCat40 -0.6997 0.2749 -2.545 0.01093 *
## NCDCat50 -0.6505 0.1612 -4.035 5.45e-05 ***
## AgeCat2 10.1881 298.1364 0.034 0.97274
## AgeCat3 10.2931 298.1363 0.035 0.97246
## AgeCat4 10.1762 298.1363 0.034 0.97277
## AgeCat5 9.9213 298.1363 0.033 0.97345
## AgeCat6 10.6297 298.1364 0.036 0.97156
## AgeCat7 10.9335 298.1371 0.037 0.97075
## VAgeCat1 0.2344 0.1638 1.431 0.15251
## VAgeCat2 0.3470 0.1596 2.174 0.02968 *
## VAgeCat3 0.1254 0.2223 0.564 0.57257
## VAgeCat4 -0.4589 0.2375 -1.932 0.05332 .
## VAgeCat5 -1.2343 0.2749 -4.490 7.13e-06 ***
## VAgeCat6 -1.5824 0.5954 -2.657 0.00787 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2141.4 on 5985 degrees of freedom
## Residual deviance: 2035.0 on 5966 degrees of freedom
## AIC: 2850.4
##
## Number of Fisher Scoring iterations: 11
# for nested model comparison, hence not use for reg2 and reg1 comparison
anova(poisson_reg1,poisson_reg3,test="Chisq")
## Analysis of Deviance Table
##
## Model 1: Clm_Count ~ SexInsured
## Model 2: Clm_Count ~ SexInsured + NCDCat + AgeCat + VAgeCat
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 5983 2130.3
## 2 5966 2035.0 17 95.327 6.48e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
final_model<-poisson_reg3
df_train$predict<-exp(predict(final_model))/df_train$Exp_weights
mean(df_train$predict)
## [1] 0.1317536
mean(df_train$Frequency)
## [1] 0.1293873
sum(df_train$Clm_Count)/sum(df_train$Exp_weights)
## [1] 0.1319254
final_model<-poisson_reg3
df_validate$predict<-exp(predict(final_model,newdata=df_validate))/df_validate$Exp_weights
mean(df_validate$predict)
## [1] 0.1305999
mean(df_validate$Frequency)
## [1] 0.1636985
sum(df_validate$Clm_Count)/sum(df_validate$Exp_weights)
## [1] 0.1443286
#Output to Excel or CSV
#write.csv(check,file="P:/YT_R/HandsOn/output.csv")
write.csv(df_validate,file="/Users/yuantian2019/Downloads/output.csv")
# or write_csv from readr package, which is faster than write.csv